# Multimodal Robot Control
STEVE R1 7B SFT GGUF
Apache-2.0
Static quantized version of STEVE-R1-7B-SFT, supporting multiple quantization levels for different hardware requirements
Text-to-Image English
S
mradermacher
203
0
Minivla Vq Bridge Prismatic
MIT
MiniVLA is a more compact yet higher-performing vision-language-action model, compatible with the Prismatic VLMs project codebase.
Image-to-Text
Transformers English

M
Stanford-ILIAD
22
0
Rdt 170m
MIT
RDT-170M is a 170-million-parameter imitation learning diffusion Transformer model designed for robot vision-language-action tasks.
Multimodal Fusion
Transformers English

R
robotics-diffusion-transformer
278
7
Rdt 1b
MIT
A 1-billion-parameter imitation learning diffusion Transformer model pretrained on 1M+ multi-robot operation data, supporting multi-view visual-language-action prediction
Multimodal Fusion
Transformers English

R
robotics-diffusion-transformer
2,644
80
Octo Small
MIT
Octo Small is a robot control model trained based on diffusion policy, capable of predicting 7-dimensional actions for the next 4 steps, suitable for multi-source robot datasets.
Multimodal Fusion
Transformers

O
rail-berkeley
335
13
Octo Base
MIT
Octo is a robot control foundation model trained on diffusion policy, capable of predicting future actions and processing multimodal inputs.
Multimodal Fusion
Transformers

O
rail-berkeley
215
21
Featured Recommended AI Models